156 ◾ Bioinformatics
mkdir ucsc
cd ucsc
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.
x86_64/ ./
This will download all binaries in “ucsc” directory.
After downloading the right tool, you can use it to convert the annotation GFF file into
GenePred file:
gff3ToGenePred \
GCF_009858895.2_ASM985889v3_genomic.gff \
SARSCOV2_refGene.txt
This will create “SARSCOV2_refGene.txt”, which is a GenePred file.
3. Use ANNOVAR “retrieve_seq_from_fasta.pl” script to generate a transcript FASTA file
from the reference sequence.
retrieve_seq_from_fasta.pl \
--format refGene \
--seqfile GCF_009858895.2_ASM985889v3_genomic.fna \
SARSCOV2_refGene.txt \
--out SARSCOV2_refGeneMrna.fa
This will create a transcript FASTA file “SARSCOV2_refGeneMrna.fa”.
Thus, the gene-based database for SARS-CoV-2 variants is ready to use. We will use it
in a later example.
4.3.3.2 ANNOVAR Input Files
The “annotate_variation.pl” script is the core ANNOVAR program for variant annotation.
The raw variants (SNVs or InDels) must be in an ANNOVAR input file for annotate_varia-
tion.pl. The ANNOVAR input file is a plain text file that contains space- or tab-delimited
five columns for chromosome, start position, end position, the reference nucleotides, and
the observed nucleotides. Additional columns can be added. You can open the example
ANNOVAR input file “ex1.avinput” in the “example” directory to have an idea about how
it looks. Use “less -S ex1.avinput” to display the file.
Since variants come in different variant calling file formats, “convert2annovar.pl” script
can be used to convert those files to the ANNOVAR input file format. The variant calling
files that can be converted by that program include VCF format, samtools genotype-calling
pileup format, Illumina export format from GenomeStudio, SOLiD GFF genotype-calling
format, and complete genomics variant format. To learn about the use and options of “con-
vert2annovar.pl” script, run the following:
convert2annovar.pl -h